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QUANTA: Xray Structure 
and Analysis 

B Creating a Fragment 
Database 

Use the following procedure to create a database to be used by the Search Fragment Database utility. 

1 . From the Brookhaven database, select a set of protein coordinates files that have good 
resolution and include different structure types. 

2. Construct a file (dmlist) that contains a list of these protein coordinate files. Use the 
following format in constructing the file: 

# Number of proteins to be used. 

# Name of coordinate file 1 . 

# Name of coordinate file 2. 



# Name of coordinate file n. 

3. Run the program $HYD_MSF/dmprep. The program prompts for the name of the file 
(dmlist) containing the list of proteins and asks for a name for the distance matrix file 
(dmfile.new) to be created . The program then reads each protein coordinate file and 
constructs a distance matrix file. It also creates a QUANTA input command file. The 
command file is used from within QUANTA to generate an MSF for each of the protein 
coordinate files. You are prompted to name this file. 

The dmprep executable distributed with QUANTA can handle up to 2,000 proteins with limits of 
2,000 residues and 100,000 Cct distances per protein. The FORTRAN sources for dmprep 
(dmprep. f and dmsubs.f) are also distributed. This gives you flexibility to increase the dimensions 
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as you need them. 

4. Move the distance matrix file to the $QNT_ROOT/dmatrix directory and rename it to 
dmfile. Because the variable $HYD_DMF is already defined in the QUANTA environment 
as $QNT_ROOT/dmatrix/dmfile, you can do this easily by typing: 



cp dmfile. new $HYD_DMF 



where dmfile. new is the filename of the distance matrix file created in step 3. 

5. To create required MSFs, start QUANTA and type @command_file, where 
command_file is the name given to the QUANTA command file. Respond appropriately to 
the dialog boxes. Treat the sixth character in the atom field as a disorder using the no- 
hydrogen dictionary file, and exclude symmetry in the molecular structure file. 

6. Move the newly created MSFs to the directory $MSF_LIB. 



@accelcysL 
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Last updated January 06, 1999 at 05:54PM PST. 

Copyright © 1998, 1999 Molecular Simulations Inc. All rights reserved. 
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This web page is still under construction. 



jdeline@pacbell.net 



I am a Ph.D. chemist trained in synthetic organic chemistry. I have been 
working for The Clorox Company for about the past twelve years, and I live and 
work in the San Francisco Bay Area of California. 

Over the years I have written a couple of software programs for chemists, which 
I give away for free. 

MacFormula 

MacFormula is a molecular weight calculator and more. Enter a formula and 
(optionally) a mass or molar amount, and MacFormula will calculate the 
molecular weight, % elemental composition, and either a mass or molar amount 
(depending upon your optional input). Great for planning reactions. 

Comes in both a Macintosh and Windows 95 version. The Windows version is 
called "WinFormula." 

Download MacFormula Download WinFormula 

MF Calc ("Molecular Fragment Calculator") 

MF Calc will take a user defined mass and calculate all of the possible elemental 
combinations possible with that mass. The program is very flexible in that the 
user can control the degree of the precision of the mass, as well as which 
elements (and the amounts) that should be included in the search. Will calculate 
an exact formula from a high-resolution mass spec value. Available in both Mac 
and Windows versions. 

Download MF Calc (Mac Download MF Calc (Windows version) 
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Document type: Journal Article 
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Although many disparate methods have been applied to the problem, the 
accuracy of protein structural prediction still remains disappointingly 
low, averaging about 65% correct secondary structure assignment. A novel 
predictive method is presented here, which attempts to address some of the 
shortfalls inherent in representing a protein as a simple text-like 
sequence of amino acids, by deriving pattern-matching data from the 
predicted physical properties of a protein chain rather than from the 
sequence itself. A unique binary encoding algorithm is used to enable the 
property profiles to be correlated with known secondary structure , and 
hence to predict secondary structures for proteins with unknown 
structures . By treating the sequence in this manner, predictive 
accuracies averaging over 75% have been achieved. 

Descriptors: * Algorithms; * Protein Conformation; Amino Acid Sequence ; 
Computer Simulation; Databases , Factual; Molecular Sequence Data 
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Abstract (Basic): WO 200241179 A2 

NOVELTY - Two molecular structure data selected from a data set 
are compared to determine molecular fragment data which is then 
stored. The process is repeated in which one of the molecular 
structure data is selected from either the predetermined molecular 
structure data or the determined molecular fragment data, such 
that the resultant data set is stored in a database . 

DETAILED DESCRIPTION - INDEPENDENT CLAIMS are also included for: 

(1) molecular fragments and biological target characteristics 
relationship determining method; 

(2) automated predicted biological target characteristic data 
generation method; and 

(3) predicted biological target characteristic data generation 
program . 

USE - For generating molecular fragments relating to drugs. 

ADVANTAGE - Since the molecular fragments that are actually 
found within the molecules of the data set are determined, time is 
not wasted in considering entities which are not present. The method is 
not limited to any particular type of molecular structure . The 
database provides the potential for improved data upon which 
subsequent modeling is performed. 

DESCRIPTION OF DRAWING (S) - The figure shows a flow diagram 
explaining the molecular fragment database generation method. 

pp; 31 DwgNo 1/4 

Title Terms: MOLECULAR ; FRAGMENT ; DATABASE ; GENERATE; METHOD; DRUG; 

DETERMINE; MOLECULAR ; FRAGMENT ; FOUND; MOLECULAR ; DATA; SET 
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Abstract (Basic) : US 20030138810 Al 

NOVELTY - Identifying gene clusters, comprising preparing small- 
and large-insert libraries of DNA fragments from genomic DNA, 
sequencing fragments from the small-insert library, comparing using 
computerized methods to a database of known gene clusters to identify 
fragments with similar sequences, and using fragments identified to 
detect gene clusters from the large-insert library, is new. 

DETAILED DESCRIPTION - Identifying gene clusters, comprising: 

(a) preparing small-insert and large-insert libraries respectively 
of DNA fragments from genomic DNA; 

(b) determining DNA sequence of at least part of some of the 
fragments in the small-insert library to form Gene Sequence Tags 
(GSTs) ; 

(c) comparing , under computer control, GSTs or corresponding 
amino acid sequences with sequences in a database containing 
genes, gene fragments or DNA/ amino acid sequences known to be 
part of a gene cluster to identify GSTs with similar structure to a 
database sequence; and 

(d) using an identified GST to detect a DNA fragment from the 
large-insert library containing the GST and a gene cluster. 

An INDEPENDENT CLAIM is also included for a similar method in which 
only a large-insert library is prepared and GSTs are identified from 
the large-insert library. 

USE - The method is useful to identify gene clusters associated 
with a pathogenicity island (i.e. group of genes conferring 
pathogenicity) , degradation of a compound or conferring resistance to a 
therapeutic drug, especially in cultured/uncultured microorganisms, 
particularly prokaryotes e.g. of genus Nocardia, Streptomyces, 
Stigmatella etc. (claimed). It is useful to detect gene clusters 
involved in biosynthesis of natural products e.g. to identify 
biosynthetic loci associated with particular products, distinguish 
between variations of natural products (e.g. between avilamycin-type 
and everninomycin-type orthomycins) or to identify biosynthetic loci in 
organisms not previously known to produce the product, 
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Abstract (Basic) : WO 200365247 A2 

NOVELTY - Analyzing a biochemical sequence database comprises: 

(a) providing an initial query sequence; 

(b) carrying out an alignment of the query sequence against the 
database to establish result sequences which resemble the query 

sequence according to a measure of similarity; and 

(c) if any result sequences are established and unless a stop 
condition is met, automatically repeating the second and third steps 
using each of the result sequences as a query sequence. 

DETAILED DESCRIPTION - INDEPENDENT CLAIMS are also included for: 

(1) a computer program product comprising computer program 
instructions to control a computer to carry out the method; 

(2) a computer readable medium carrying a computer program product; 

(3) an apparatus for analyzing a biochemical sequence database , 
which comprises: 

(a) a. data store holding the database ; 

(b) an input arranged to provide an initial query sequence; 

(c) an alignment engine arranged to carry out an alignment of a 
query sequence against the database to establish result sequences 
which resemble the query sequence according to a measure of 
similarity; and 

(d) control logic arranged to pass the initial query sequence to 
the alignment engine and to subsequently and iteratively pass 
selected ones of the result sequences to the alignment engine, if any 
result sequences are established and until a stop condition is met; and 

(4) a computer system for carrying out analysis of a biochemical 
sequence database , which comprises: 

(a) a storage area network adapted to store the database ; 

(b) alignment nodes, each operable in response to an instruction to 
carry out the alignment of a query sequence against at least a 
part of the database ; 

(c) a file server connected to the storage area network and to the 
alignment nodes; and 



(d) a head node connected to the file server and to each alignment 
node and operable to receive initial query sequence and to instruct 
each alignment node to carry out an alignment of the initial query 
sequence against the database , and operable to receive result 
sequences from the alignment nodes and to instruct each alignment node 
to carry out an alignment of a received result sequence against the 
database to obtain further result sequences. 

USE - The method is used for analyzing a biochemical sequence 
database . It is used e.g., for comparing a sequence or set of 
sequences from a vertebrate organism e.g. fish (e.g., zebrafish) , a 
bird, and/or a mammal {e.g. a mouse, rabbit, rat, monkey, or human) 
with a data base of sequences form an invertebrate organism e.g., 
an insect (e.g., Drosophila melanogaster ) or a nematode, or vice versa. 

ADVANTAGE - The provision of at least two different types of 
subnode in a heterogeneous cluster allows cost and performance to be 
balanced as required, and reduces the cost of achieving a given level 
of performance when carrying out a recursive alignment. 

DESCRIPTION OF DRAWING (S) - The figure is a flow diagram 
illustrating a recursive alignment method for analyzing a biochemical 
sequence database . 
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Abstract (Basic) : WO 200221428 Al 

NOVELTY - A computer-readable structure comprising records for 
storing different types of data relating to respective proteins , a 
parameter field for indicating a selected characteristic of the 
corresponding protein , a location field for indicating the relative 

location in the organism from which the protein was obtained, and 
an abundance field for indicating the relative amount of the protein , 
is new. 

DETAILED DESCRIPTION - A computer-readable structure, encoded on a 
computer-readable medium, comprises records for storing different types 
of data relating to respective proteins , a parameter field for 
indicating a selected characteristic of the corresponding protein , a 
location field for indicating the relative location in the organism 
from which the corresponding protein was obtained, and an abundance 
field for indicating the relative amount of the corresponding protein 
obtained from the location , where each record has at least an 
identification field for identifying a corresponding one of the 
proteins , is new. 

INDEPENDENT CLAIMS are also included for the following: 

(1) a computer program product for extracting selected data 



relating to a protein from a database comprising a 

computer-readable medium, a user interface module for guiding a user to 
generate at least one query to retrieve selected data from the 
database , a database search module communicatively coupled to the 
user interface module and operable to locate and retrieve the 
database that correspond to the query ; 

(2) determining the proteome of an individual comprising taking a 
protein containing sample from each of at least 5 tissue from an 
individual and determining the presence and relative abundance of at 
least 10 proteins from each of the tissues; 

(3) identifying a protein marker that indicates a condition by 
change in abundance comprising determining the abundance of a candidate 

protein marker in the same biological samples that have different 
selected characteristic ( s ) , accessing a database comprising entries 
for providing data relating to proteins including the candidate 
protein marker, and comparing the abundance of the candidate 
protein marker to the entries in the database ; 

(4) obtaining proteomic information comprising generating a query 
to retrieve selected data relating to a protein from the computer 
program, locating a record in the protein index database that 
satisfies protein characteristics requested via the query and 
generating an output corresponding to the record; 

(5) identifying component-specific proteins from a database 
comprising information relating to a number of proteins comprising: 

(a) generating a first list of all proteins indicated in the 
database as being located in a specimen of a first selected 
component; 

(b) generating a second list of all proteins indicated in the 
database as being located in a specimen of a second selected 
component ; 

(c) subtracting from the first list all of the proteins common 
to both lists ; and 

(d) repeating steps (b) and (c) for components 3-n, where n is 
the total number of components in the database 6) creating a 
polypeptide database comprising: 

(a) generating a 2-D separation of polypeptides of two sources; 

(b) generating an electronic image of the 2-D separation of 
polypeptides of the two sources; 

(c) warping one of the electronic images of the 2-D separation of 
polypeptides to the other image; 

{d) analyzing the two 2-D separation of polypeptides of the 
sources to determine polypeptide spots common to both tissues; 

(e) confirming commonality of at least a portion of the 
polypeptide spots common in both the two 2-D separation of 
polypeptides ; 

(f) recording in a database polypeptide spots common to both 
tissues as being the same in response to positive confirmation of the 
portion of the spots common to both 2D separation of polypeptides ; 

(g) analyzing polypeptide spots not common to both 2-D 
separations; and 

(h) recording in the database results of the analyzing the 
polypeptide spots not common to both 2-D separations; 

(7) identifying a polypeptide in a sample from an individual of a 
randomly breeding population comprising: 

(a) characterizing the polypeptide by isoelectric point and 
molecular weight; 

(b) identifying tissues of the subject where the polypeptide is 
found to yield distinguishing parameters of the polypeptide 
comprising isoelectric point, molecular weight and tissue 
distribution; 

(c) comparing parameters with distinguishing parameters of 
previously tested polypeptides of a set; and 

(d) determining whether a previously tested polypeptide has the 
parameters of the polypeptide ; and 

(8) a data processing system for determining identity of an element 
(N+l) to N elements of a database contained in a storage medium 
comprising computer processing mechanism, data storage mechanism, and 
mechanism for processing data regarding comparing a parameter of the 



(N+l) element with the parameter of the N elements of the database , 
where : 

(a) the element is a protein or polypeptide ; 

(b) processing data is repeated at least M times, where each M 
parameter is examined at each iteration (where M is at least 3) and 
when the (N+l) element does not have M identical parameters of N 
element (s), the data storage mechanism adds data of the (N+l) element 
and of the M parameters to the database to produce a new database 
comprising (N+l) elements; 

(c) the database comprises database elements corresponding to 
proteins in tissues obtained from a selected organism; and 

(d) a difference in abundance of the candidate protein marker 
identifies the candidate protein marker as a protein marker for the 
condition . 

USE - For organizing database elements corresponding to proteins 
in tissue obtained from a selected organism, organelle, cell, tissue, 
organ, or population. 

ADVANTAGE - The invention can measure the same protein in 
multiple different tissues. It can also measure the abundance of a 
protein at a particular location . 

DESCRIPTION OF DRAWING (S) - The figure is a schematic block diagram 
showing the steps that form part of the analysis for comparing 
proteins of different tissues. 
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Abstract (Basic) : WO 200135316 A2 

NOVELTY - A computer-based method (Ml) of drug design that uses 
three-dimensional (3-D) protein structural models derived from 
genetic polymorphisms, is new. 

DETAILED DESCRIPTION - A computer-based method (Ml) of drug design 
that uses three-dimensional (3-D) protein structural models derived 
from genetic polymorphisms, is new. 

Ml comprises: 

(a) obtaining more than one amino acid sequence of target 
proteins that are the product of a gene exhibiting genetic 
polymorphisms, where the sequences represent different genetic 
polymorphisms ; 

(b) generating 3-D protein structural variant models from the 
sequences; and 

(c) based upon the structures of the 3-D models, designing drug 
candidates, modifying existing drugs, identifying potential drug 
candidates or identifying modifications of existing drugs based on 
predicted intermolecular interactions of the drug candidates or 
modified drugs with the structural variants. 

INDEPENDENT CLAIMS are also included for the following: 
(1) a computer-based method (M2) of selecting drug therapies for 
patients based on genetic polymorphisms, comprising: 

(a) step (a) and (b) of Ml; 

(b) computationally docking drug molecules with the target 
protein models; 

(c) energetically refining the docked complexes; 

(d) determining the binding interactions between the drug or 
potential 15 new drug candidate molecules and the models; and 

(e) selecting drug therapies based on the drug or drugs that have 



the most favorable binding interactions with the structural variant 
models; 

(2) a computer-based method for predicting clinical responses in 
patients based on genetic polymorphisms, comprising: 

(a) steps (a) and (b) of Ml; 

(b) building a relational database of protein structural 
variants derived based on genetic polymorphisms and observed clinical 
data associated with particular polymorphisms exhibited in the 
patients, where the database comprises 3-D molecular coordinates 
for the structural variant models, a molecular graphics interface 
for 3-D molecular structure visualization, computer functionality for 

protein sequence and structural analysis, database searching 
tools, and observed clinical data associated with the genetic 
polymorphisms, subject medical history and subject history associated 
with the genetic polymorphisms, obtaining a target protein 
structural variant based on the same gene associated with a 
polymorphism in a patient; 

(c) generating a 3-D protein model based on the subject's gene 
sequence; 

(d) screening/ comparing the 3-D model derived from the subject to 
the structures contained in the database by identifying structures in 
the database that are similar to the model derived from the subject 
and predicting a clinical outcome for the patient based on the clinical 
data associated with the identified structures; 

(3) a computer-based method for designing therapeutic agents that 
are active against biological targets that have become drug resistant 
due to genetic mutations, comprising obtaining a first 3-D protein 
structural variant model of a target protein against which a given 
drug has biological activity, generating a second 3-D protein 
structural variant model of the target in which genetic mutations 
have occurred and against which the same drug is no longer 
biologically active, comparing the structures of the first and second 
model to identify structural differences, and performing 
structure-based drug design calculations in order to identify new drugs 
or modifications to the existing drug to bring about biological 
activity against the second model; 

(4) a computer-based method for identifying compensatory mutations 
in a target protein , comprising obtaining the amino acid sequence 
of a target protein containing multiple amino acid mutations that 
is expressed in a patient, where the structure of a form of the target 
protein that responds to a particular drug, including the active site, 
has been structurally characterized, generating a 3-D structural model 
of the mutated protein ; comparing the structure of the mutated 
protein with the form of the protein that responds to the drug to 
identify structural differences and/or similarities arising from the 
mutations, comparing the biological activities of the drug against 
both the mutated protein and the form of the protein that responds 
to the drug to determine the effects of the mutations on drug response, 
and identifying the mutations in the protein that affect biological 
activity based on the comparisons ; 

(5) a method (M3) for creating a 3-D structural polymorphism 
relational database , comprising obtaining one or more amino acid 
sequences of a target protein that is the product of a gene 
exhibiting a genetic polymorphism, where sequences represent 
different genetic polymorphisms, generating 3-D protein structural 
variant models from the sequences, energetically refining the models, 
evaluating the quality of the models, optionally obtaining associated 
clinical properties or data, and inputting the model and any associated 
properties and/or data into a relational database ; 

(6) a database (Dl) created by M3; 

(7) a computer system, comprising a database containing data 
representative of the three dimensional structure of polymorphic 
variants of a drug target; 

(8) a database (D2) comprising: 

(a) sequences of nucleotides encoding a protein or its portions, 
where the protein comprises polymorphic variants and the portions 
encode a domain of the protein that comprises a site which binds to a 
drug candidate; andb) the coordinates of 3-D structures of the encoded 



protein or its portions; and 

(9) a database (D3) comprising the 115 nucleotide sequences 
defined in the specification that encode HIV protease or a portion of 
HIV reverse transcriptase. 

USE - The computer-based method is useful for designing drug 
candidates, modifying existing drugs, identifying potential drug 
candidates or identifying modifications of existing drugs based on 
predicted intermolecular interactions of the drug candidates or 
modified drugs with the structural variants. The method is also useful 
for understanding and overcoming drug resistance using the 3-D protein 
model structures resulting from multiple genetic polymorphisms or 
mutations in infectious agents e.g. HIV. 
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Abstract (Basic) : WO 9937816 Al 

NOVELTY - A method for identifying a polynucleotide (PN) fragment 
of a gene conferring a selected phenotype to a sample cell, is new 
comprises : 

DETAILED DESCRIPTION - The method (Ml) comprises: 

(a) obtaining a set of PNs representing gene expression in 2 or 
more sample cells; 

(b) obtaining a set of PNs representing gene expression in one or 
more control cells; and 

(c) identifying a unique PN representing a gene that is common to 
the 2 or more sample cells and differentially expressed in the sample 
cells compared to the control cell. 

INDEPENDENT CLAIMS are also included for the following: 

(1) a method for identifying one or more PNs corresponding to one 
or more secreted biological factors comprising: 

(a) obtaining a set of PNs representing gene expression in one or 
more sample cells that secrete the factor; 

(b) obtaining a set of PNs representing gene expression in one or 
more control cells that do not secrete the factor; 

(c) identifying one or more unique PNs which are common to the 
sample cells, the unique PNs being absent or expressed at lower levels 
in the control cells; 

(2) a method for identifying a therapeutic target comprising: 

(a) obtaining a set of PNs representing gene expression in 2 or 
more sample cells; 

(b) obtaining a set of PNs representing gene expression in one or 
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more control cells; and 

(c) identifying a unique PN representing a gene that is common to 
the 2 or more sample cells and differentially expressed in the sample 
cells compared to the control cell; 

(3) a method of creating a database of PN data resulting from 
processing cell samples comprising: 

(a) transferring sequence records that correspond to PNs obtained 
from a sample of cells electronically to a computer processor and 
creating a data raw file containing observed PN abundances related to 
the samples; and 

(b) creating a compare data file by combining the data raw 
file with other data raw files, the other data raw files having been 
created from other samples; where the compare data file contains 
records from the data raw files, the data having been normalized to 
indicate percentage of sample for the number of occurrences of a PN in 
each of samples from the cells; 

(4) a system for identifying selected PN records comprising: 

(a) a digital computer; 

(b) a database coupled to the computer; 

(c) a database coupled to a database server having data stored 
in it, the data comprising records of data combined from PN raw files, 
the data having been normalized to indicate percentage of sample for a 
number of occurrences of a same tag in each sample of the samples; and 

(d) a code mechanism for applying queries based upon a desired 
selection criteria to the data file in the database to produce 
reports of PN records which match the desired selection criteria; 

(5) a method for identifying selected PN records from a database , 
using a computer having a processor, memory, display, input/output 
devices, comprising: 

(a) providing a database coupled to the computer having data 
stored in it the data comprising representations of data combined from 
PN raw files, the data having been normalized to indicate percentage of 
sample for a number of occurrences of a same PN in each of the samples; 
and 

(b) using a code mechanism for applying queries based upon a 
desired selection criteria to the data file in the database to 
produce reports of PN records which match the desired selection 
criteria. 

USE - The methods can be used with sample cells such as neoplastic 
cells, drug-resistant neoplastic cells, neoplastic cells which promote 
angiogenesis , de-differentiated cells, differentiated cells, apoptotic 
cells, hyperprolif erative cells, cells infected with a pathogen, 
drug-resistant cells infected with a pathogen or plant cells. The 
selected phenotype may be associated with e.g. genetic disease, 
altered metabolic activity, senescence, apoptosis, drug metabolism or 
allergic reaction. Antibodies against proteins encoded by the 
identified PNs, immune effectors or antigen presenting cells presenting 
the protein , can be used with a cytokine or a co-stimulatory 
molecule for the therapy of disorders, e.g. for inducing an immune 
response against a polypeptide associated with a neoplastic 
phenotype (all claimed) . 
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Main International Patent Class: G06F-017/30 
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Publication Language: English 
Filing Language: English 
Fulltext Availability: 

Detailed Description 

Claims 

Fulltext Word Count: 12501 
English Abstract 

A computer system comprising a database (100) having a plurality of 
records is provided. Each record comprises a filed point representation 
representing field extrema for a conformation of a chemical structure. 
The database may include records for multiple conformations of the same 
chemical structure. Each record can have a searchable index of the filed 
point representation. In one embodiment the index is bit string. An 
indexing mechanism for generating an index, a searching mechanism for 
searching the database and a graphical user interface to enable a user to 
interface with the database (100) are also provided. 

French Abstract 

L' invention concerne un systeme informatique comprenant une base de 
donnees (100) qui possede une pluralite de fichiers. Chaque fichier 
comprend une representation de point de champ representant des extremites 
de champ pour une conformation d'une structure chimique. La base de 



donnees comprend des fichiers pour des conformations multiples de la meme 
structure chimique. Chaque fichier peut presenter un index consultable de 
la representation de point de champ. Dans un mode de realisation, l 1 index 
est une chaine de bits. L* invention concerne egalement un mecanisme 
d' indexation permettant de generer un index, un mecanisme de recherche 
permettant de consulter la base de donnees et une interface graphique 
utilisateur permettant a un utilisateur d'interagir avec la base de 
donnees (100) . 

Legal Status (Type, Date, Text) 

Publication 20040318 Al With international search report. 

Publication 20040318 Al Before the expiration of the time limit for 

amending the claims and to be republished in the 
event of the receipt of amendments. 

Main International Patent Class: G06F-017/30 
International Patent Class: G06F-017/50 ... 

. . . G06F-019/00 

Fulltext Availability: 
Detailed Description 

Detailed Description 

key is generated. As the search proceeds, the search key is compared 
to the bit string of each molecule in the database . If a TRUE bit 
in the search key is not also set as TRUE in 
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Detailed Description 
Claims 

Fulltext Word Count: 26108 
English Abstract 

Provided are methods and systems for identification of proteins using 
high mass accuracy mass spectrometry. Not only do high mass accuracy 
measurements provide greater confidence in protein identification 
assignments, but they also enable proteins to be identified with either 
less sequence coverage or fewer additional tandem MS experiments. In 
addition, high mass measurement accuracy optionally allows protein 
identifications to be made on the basis of the mass of a single peptide, 
providing higher-throughputs in the analysis of mixtures due to the 
significant decrease in time spent on additional tandem MS experiments. 
In addition, a concomitant time saving in the cross correlation process 
of mass spectral data with in silico digested databases would also be 
achieved. 

French Abstract 

L 1 invention concerne des procedes et des systemes destines a identifier 
des proteines a l ? aide de spectrometrie de masse elevee, precise. Les 
mesures precises de masse elevee permettent une meilleure confiance dans 
les attributions d 1 identification de proteines mais elles permettent 
aussi d 1 identifier des proteines, soit avec une moindre couverture de 
sequence, soit avec moins d 1 experiences supplementaires de spectrometrie 
de masse en tandem. En outre, la mesure precise de masse elevee permet, 
eventuellement , de realiser des identifications de proteines reposant sur 
la masse d'un seul peptide, autorisant une plus grande productivity dans 
1' analyse de melanges en raison du raccourcissement du temps passe sur 



des experiences supplementaires de spectrometrie de masse en tandem. On 
realise aussi une economie de temps concomitante dans le processus de 
correlation entre des donnees spectrales de masse et des bases de donnees 
de digestion in silico. 

Legal Status (Type, Date, Text) 

Publication 20030703 Al With international search report. 

Main International Patent Class: G06F-019/00 
Fulltext Availability: 
Detailed Description 

Detailed Description 

each mass in the list of theoretical masses corresponds to one and 
only one unique peptide sequence) . In this embodiment, correlation of 
an experimental peak with a unique mass from the... 

...The data complexity reduction methods of the present invention can 
optionally be performed in an iterative manner, to further assign the 
unidentified MS peaks based upon information gleaned from the previous 
round of analysis. In this embodiment, after identification of one or 
more parent protein sequences (for example, by correlating an MS peak 
with a unique theoretical mass), the first database of identified 
proteins is regenerated to include the newly identified parent 
protein sequences (e.g., additional member proteins). Additional in 
silico peptide fragments are generated from the information in the 
updated first database , and the corresponding (unique and/or 
non-unique) theoretical masses are again compared to the list of 
mass peaks for the sample, to further reduce the number of unidentified 
MS peaks and to possibly correlate unassigned MS peaks to further 
additional parent proteins. The steps of regenerating the list of 
parent proteins, calculating theoretical masses for component peptides, 
and correlating the list to the remaining unidentified MS peaks is 
optionally repeated until no additional member proteins are 
identified. 

[00161 Optionally, the member proteins in the sample (or 

proteolytically-cleaved fragments thereof) can be isotopically labeled 
prior to generating the mass list, to further assist in... 
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Fulltext Availability: 
Detailed Description 
Claims 

Fulltext Word Count: 21640 
English Abstract 

Structural alignment methods are described that compare the sequences of 
two or more structural features of molecules. The methods provide for a 
rigorous statistical analysis that can detect structural similarities in 
molecules regardless of the similarity in their primary sequences. Thus, 
the methods can be used to predict and explain functional properties of 
molecules from their three-dimensional conformation. The methods use 
databases of different structural features against which a query sequence 
can be searched. By combining the search results from the various 
databases, the functional properties of molecules can be predicted and 
serve as a basis for the efficient design of ligands, substrate 
analogues, inhibitors or pharmaceutical species thereof. 

French Abstract 

L' invention concerne des procedes d f alignement de structures consistant a 
comparer les sequences de deux ou de plusieurs caracteristiques de 
structures de molecules. Les procedes fournissent une analyse statistique 
rigoureuse capable de detecter des similitudes de structure dans les 
molecules, independamment de leurs sequences primaires. Les procedes 
peuvent done etre utilises pour prevoir et expliquer les proprietes 
f onctionnelles de molecules a partir de leur configuration 
tridimensionnelle . Les procedes utilisent des bases de donnees de 
differentes caracteristiques de structures vis-a-vis desquelles une 



sequence d 1 interrogation peut etre cherchee. En combinant les resultats 
des recherches provenant des diverses bases de donnees, les proprietes 
f onctionnelles des molecules peuvent etre prevues et servir de base pour 
la conception efficace de ligands, d' analogues de substrats, 
d ' inhibiteurs ou d'especes pharmaceutiques de ceux-ci. 
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Publication 20030612 A2 Without international search report and to be 
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Detailed Description 

Detailed Description 
... pocket. This process is 
1 8 

repeated for every pocket and void in the pvSoar database to create a 
new database of pocket and void signature of amino acid residue 

distributions (pvSoarD) . The signature composition distributions can be 
compared to each other in any number. . . 
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Detailed Description 

Claims 

Fulltext Word Count: 35010 
English Abstract 

A system for analyzing a vast amount of data representative of chemical 
structure and activity information and concisely providing conclusions 
about structure-to-activity relationships. A computer may adaptively 
learn new substructure descriptors based on its analysis of the input 
data. The computer may then apply each substructure descriptor as a 
filter to establish new groups of molecules that match the descriptor. 
From each new group of molecules, the computer may in turn generate one 
or more additional new groups of molecules. A result of the analysis in 
an exemplary arrangement is a tree structure that reflects pharmacophoric 
information and efficiently establishes through lineage what effect on 
activity various chemical substructures are likely to have. The tree 
structure can then be applied as a multi-domain classifier, to help a 
chemist classify test compounds into structural subclasses. 

French Abstract 

L' invention concerne un systeme permettant d 1 analyser une grande quant ite 
de donnees representant des structures chimiques et des informations 
d'activite, et donnant des conclusions concises concernant les relations 
structure-activite. Un ordinateur peut apprendre de maniere adaptative de 
nouveaux descripteur de sous-structures d'apres son analyse des donnees 
entrees. L 1 ordinateur peut ensuite appliquer chaque descripteur de 
sous-structure en tant que filtre en vue d'etablir de nouveaux groupes de 



molecules correspondant au descripteur . A terme, l'ordinateur peut, a 
partir de chaque nouveau groupe, generer de nouveaux groupes de molecules 
supplementaires . Un resultat de 1' analyse peut etre, par exemple, une 
structure arborescente refletant des informations pharmacophoriques et 
etablissant par des lignes les effets que differents produits chimiques 
sont susceptibles d 1 avoir sur l'activite. Ces structures arborescentes 
peuvent etre utilisees en tant que classeur multi-domaine, aux fins 
d' aider un chimiste a classer des composes test dans des sous-classes 
structurelles . 
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Detailed Description 

... sti-in(." ) By way of example and without limitation, a useful system 
for representing chemical molecules in ASCII form is also provided by 
Daylight Chemical Information Systems, Inc. Daylight establishes a... 

. . .can be used to specify substructures using rules that are straightfo 
rward extensions of SMILES strings . Additional inforination about 
Daylight SMARTS keys is provided at the Daylight web site indicated 
above . 

According to Daylight, both SMILES and SMARTS strings employ atoms and 
bonds as fundamental svmbols, which can be used to specify the nodes and 
edges of a molecule ! s graph and assism labels to the components of 
the graph . SMARTS strings are interpreted as patterns that can be 
matched against SMILES string representations of molecules , in the 
form of database queries for instance. Other examples of substructure 
representations include "MACCS" keys (i.e.. f i-aurnent-based keys for use 
in describing molecules , where MACCS stands for "the Molecular ACCess 
System) and other keys as defined by MDL Information Systems, Inc., for 
instance . ( For . . . 
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Main International Patent Class: G06F-019/00 
Publication Language: English 
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Detailed Description 

Claims 

Fulltext Word Count: 11078 
English Abstract 

A system and method for providing improved de novo structure based drug 
design that include a method for more accurately predicting binding free 
energy. The system and method use a coarse graining model with 
corresponding knowledge based potential data to grow candidate molecules 
or ligands (108) . In light of the present invention using the coarse 
graining model, the novel growth method (108) of the present invention 
uses a metropolis Monte Carlo selection process (218) which result in a 
low energy structure that is not necessarily the lowest energy structure, 
yet a better candidate (110) can result. 

French Abstract 

L 1 invention porte sur un systeme et un procede de conception rationnelle 
des medicaments sur la base d'une structure de novo, amelioree, et 
faisant intervenir un procede de prediction precise de I'energie libre de 
liaison. Ce systeme et ce procede utilisent un modele de granulation 
grossiere avec des donnees potentielles basees sur une connaissance 
correspondante de facon a developper des molecules ou ligand candidats 

(108). A la lumiere de la presente invention utilisant le modele de 
granulation grossiere, le nouveau procede de developpement moleculaire 

(108) met en oeuvre une methode de selection Metropolis Monte Carlo (218) 
qui donne lieu a une faible structure energetique, qui n'est pas 
necessairement la plus faible, mais qui, toutefois, permet d'obtenir un 
meilleur candidat (110). 

Main International Patent Class: G06F-019/00 
Fulltext Availability: 
Detailed Description 

Detailed Description 

greater detail subsequently. 

It is known to use one of two methods to automatically search databases 
that contain large amounts of data relating to fragments that can be 
used for building molecules or ligands for developing lead 
candidates. A first method is the Geometric method that matches... 
functional groups. HOOK uses random placement of many copies of several 
functional fragments followed by molecular dynamics. 



Multiple Start Monte Carlo methods also have been used as 

fragment joining methods. These methods conduct searches of databases 



for fragments of a ligand to dock at the receptor site. 
BUILDER software, uses a family of docked structures to provide an 
irregular lattice of controllable density. . .nearly I billion candidates 
of 5 functional groups 505 combinations. As the size of the database 
of molecular fragments increases, it is readily seen that the number 
of possible combinations will increase dramatically. As... 



24/5, K/13 (Item 13 from file: 349) 

DIALOG (R) File 34 9:PCT FULLTEXT 

(c) 2004 WIPO/Univentio. All rts. reserv. 

00386816 **Image available** 

METHOD OF CREATING AND SEARCHING A MOLECULAR VIRTUAL LIBRARY USING 

VALIDATED MOLECULAR STRUCTURE DESCRIPTORS 
PROCEDE POUR CREER UNE BIBLIOTHEQUE MOLE CULA I RE VIRTUE LLE ET PROCEDE POUR Y 
FAIRE DES RECHERCHES, EN UTILISANT DES DESCRIPTEURS VALIDES DE 
STRUCTURE MOLECULAIRE 
Patent Applicant/Assignee: 

PATTERSON David E, 

CRAMER Richard D, 

CLARK Robert D, 

FERGUSON Allan M, 
Inventor (s) : 

PATTERSON David E, 

CRAMER Richard D, 

CLARK Robert D, 

FERGUSON Allan M, 
Patent and Priority Information (Country, Number, Date) : 

Patent: WO 9727559 Al 19970731 

Application: WO 97US1491 19970127 (PCT/WO US9701491) 

Priority Application: US 96592132 19960126; US 96657147 19960603 
Designated States: AU CA CN CZ HU IL JP KR NO PL US AT BE CH DE DK ES FI FR 

GB GR IE IT LU MC NL PT SE 
Main International Patent Class: G06F-019/00 
Publication Language: English 
Fulltext Availability: 

Detailed Description 

Claims 
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English Abstract 

The problem of how to select out of a large chemically accessible 
universe molecules representative of the diversity of that universe is 
resolved by the discovery of a method to validate molecular structural 
descriptors. Using the validated descriptors, optimally diverse subsets 
(5) can be selected. In addition, from the universe, molecules with 
characteristics similar to a selected molecule can be identified (3) . The 
validated descriptors also enable the generation of a huge virtual 
library of potential product molecules which could be formed by 
combinatorial arrangement of structural variations and cores. In this 
virtual library it is possible to search billions of possible product 
compounds in relatively short time frames. 

French Abstract 

Le probleme de la selection de molecules dans l'univers etendu des 
molecules chimiques possibles, dans toute sa diversite, est resolu par la 
decouverte d'un procede permettant de valider des descripteurs de 
structure moleculaire. En utilisant les descripteurs valides, on peut 
selectionner des sous-ensembles (5) diversifies de maniere optimale. En 
plus, on peut identifier (3) dans cet univers des molecules possedant des 
caracteristiques similaires a celles d ! une molecule selectionnee . Les 
descripteurs valides permettent, egalement, de produire une bibliotheque 
virtuelle immense de molecules potentielles de produits qui peuvent etre 
formees par arrangement combinatoire de differentes structures et noyaux. 
Dans cette bibliotheque virtuelle, il est possible d'effectuer une 
recherche parmi des milliards de composes possibles de produits, en un 
temps relativement court. 
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of just the side chains (as was done with the topomeric CoMFA metric) 
of the molecules for the same 20 data sets. In Table 3 are shown the 



Tanimoto fingerprint density ratios for the whole molecule and side 
chain Tanimoto metrics and the corresponding X' values for the 20 data 
sets . 

TABLE 3 

Patterson Plot Ratios and Associated X2 

Col, I Col. 2 Col, 3... metric is more sensitive to the volume and shape 
of the space occupied by a molecule than is, for instance, either the 
side chain or whole molecule Tanimoto descriptor. Figure 12 provides 
an illustrative example of this feature drawn from the thiol... 
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Title: Adiabatic semi -empirical parametric method for computing 
electronic-vibrational spectra of complex molecules . 1 . Polyenes and 
diphenylpolyenes 

Author (s): Baranov, V.I.; Gribov, L.A.; Djenjer, V.O.; Zelent f sov, D.Yu. 
Author Affiliation: Vernadsky Inst, of Geochem. & Anal. Chem., Acad, of 
Sci., Moscow, Russia 

Journal: Journal of Molecular Structure vol.407, no. 2-3 p. 177-98 
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Language: English Document Type: Journal Paper (JP) 
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Abstract: A parametric semi-empirical method for the calculation of the 
vibrational structure of the electronic spectrum and the determination of 
the parameters of the molecular excited state potential surface has been 
developed. The method is based on the adiabatic molecular model and is 
unique for all sets of parameters of the excited states (first and second 
derivatives of the matrix of coulombic and resonant one-electron integrals 
with respect to the internal coordinates) . Simplified analytical 
expressions for the changes in the molecular potential surfaces on 
excitation, which account only for the first-order terms, are obtained. It 
is shown that the parameters possess distinct local properties and may be 
transferred in a homologous series of molecules. The number of most 
significant parameters, sufficient to describe the molecular model 
adequately and to obtain satisfactory quantitative results, is very small. 
Calculations of geometry changes and vibronic spectra for some polyene and 
diphenylpolyene molecules using only two parameters show good quantitative 
agreement with experimental data. It is possible to create a special 
data bank of molecular fragments for vibronic spectroscopy with 

relatively small structural groups {e.g. H>C= for polyenes and related 
compounds) and to use it to compute the excited state properties of complex 
molecules and their vibronic spectra employing the suggested parametric 
method. (38 Refs) 
Subfile: A 

Descriptors: excited states; polymers; potential energy surfaces; spectra 
; vibrational states 

Identifiers: semiempirical parametric method; electronic-vibrational 
spectra; complex molecules; polyenes; diphenylpolyenes; parametric 
semiempirical method; vibrational structure; electronic spectrum; molecular 
excited state potential surface; adiabatic molecular model; coulombic 
one-electron integrals; resonant one-electron integrals; internal 
coordinates; analytical expressions; molecular potential surfaces; 
molecular model; vibronic spectra; molecular fragments data bank ; 
vibronic spectroscopy; excited state properties 
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of atoms and molecules) 
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Abstract: A new, high-resolution shape-fragment database has been 
developed for computing ab initio quality molecular electron densities 
for polyaromatic hydrocarbons (PAHs) which play a significant role as 
toxicants in the environment. Using the new PAH electron density fragment 
database and the Molecular Electron Density Lego Assembler (MEDLA) 
method, one can generate detailed and reliable electron densities for 
virtually any of the PAH molecules. Accurate electron density shape 
representations for these molecules is essential in the study of detailed 
shape-toxicity correlations. One of our goals is to investigate the 
potential of detailed molecular shape analysis as a predictive tool in 
toxicological risk assessment. In this study we report the results of the 
first phase of the study: the construction and testing of a high quality 
shape-fragment database for PAHs. (45 Refs) 
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Abstract: In order to facilitate the three-dimensional structure 
comparison of proteins, software for making comparisons and 
searching for similarities to protein structures in databases has 
been developed . The program identifies the residues that share 
similar positions of both main-chain and side- chain atoms between two 

proteins . The unique functions of the software also include database 
processing via Internet- and Web-based servers for different types of 
users. The developed method and ifs friendly user interface copes with 
many of the problems that frequently occur in protein structure 
comparisons , such as detecting structurally equivalent residues, 
misalignment caused by coincident match of C-alpha atoms, circular 
sequence permutations, tedious repetition of access, maintenance of the 
most recent database, and inconvenience of user interface. The program 
is also designed to cooperate with other tools in structural 
bioinformatics, such as the 3DB Browser software [Prilusky (1998), 
Protein Data Bank Q. Newslett. 54, 3-4] and the SCOP database [Murzin, 
Brenner, Hubbard & Chothia (1995). J. Mel. Biol. 247, 536-540], for 
convenient molecular modelling and protein structure analysis. A 
similarity ranking score of 'structure diversity 1 is proposed in order 
to estimate the evolutionary distance between proteins based on the 
comparisons of their three-dimensional structures. The function of the 
program has been utilized as a part of an automated program for 
multiple protein structure alignment. In this paper, the algorithm of 
the program and results of systematic tests are presented and 
discussed . 
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Abstract: This paper reports a method for the identification of those 
molecules in a database of rigid 3D structures with molecular 
electrostatic potential (MEP) grids that are most similar to that of a 
user-defined target molecule . The most important features of an MEP 
grid are encoded infield- graphs , and a target molecule is matched 

against a database molecule by a comparison of the 
corresponding field- graphs . The matching is effected using a 
maximal common subgraph isomorphism algorithm, which provides an 
alignment of the target molecule ? s field- graph with those of each 
of the database molecules in turn. These alignments are used in the 
second stage of the search algorithm to calculate the intermolecular 
MEP similarities. Several different ways of generating field- graphs 
are evaluated, in terms of the effectiveness of the resulting 
similarity measures and of the associated computational costs. The most 
appropriate procedure has been implemented in an operational system 
that searches a corporate database , containing ca . 173 000 3D 
structures . 
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Abstract: A new method called the self -optimized prediction method (SOPM) 
has been developed to improve the success rate in the prediction of the 
secondary structure of proteins. This new method has been checked 
against an updated release of the Kabsch and Sander database , 1 
DATABASE .DSSP 1 , comprising 239 protein chains . The first step of 
the SOPM is to build sub- databases of protein sequences and their 
known secondary structures drawn from ' DATABASE . DSSP ? , by (i) making 
binary comparisons of all protein sequences and (ii) taking into 
account the prediction of structural classes of proteins. The second 
step is to submit each protein of the sub-database to a secondary 
structure prediction using a predictive algorithm based on sequence 
similarity. The third step is to iteratively determine the predictive 
parameters that optimize the prediction quality on the whole 
sub-database. The last step is to apply the final parameters to the 
query sequence. This new method correctly predicts 69% of amino acids 
for a three-state description of the secondary structure (alpha helix, 
beta sheet and coil) in the whole database (46 011 amino acids) . The 
correlation coefficients are C-alpha = 0.54, C-beta =0.50 and C-c = 
0.48. Root mean square deviations of 10% in the secondary structure 
content are obtained. Implications for the users are drawn so as to 
derive an accuracy at the amino acid level and provide the user with a 
guide for secondary structure prediction. The SOPM method is available 
by anonymous ftp to ibcp.fr. 
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Abstract: A program is described that searches three-dimensional, 

structural databases , given a user-defined query , in order to 
retrieve all structures that contain any combination of a 
user-specified minimum number of matching elements. Queries consist 
of three-dimensional coordinates of atoms and/or bonds. Numerous query 

constraints are described which allow the investigator to define the 
chemical nature of the desired structures as well as the environment 
within which these structures must reside. They include: (1) Bonded vs. 
isolated atom distinction; (2) Atom type designation; (3) Definition of 
subsets with occupancy specification (>, =, < X atoms); (4) RMS-fit; 

(5) Active site volume accessibility of atoms linking query elements, 

(6) Number, atom type, and cyclic structure constraints for atoms 
linking pharmacophore elements; (7) Automatic error boundary 
adjustment - ad infinitum constraint. 

To illustrate the capabilities of this program, queries based on 

the crystal structure of a thermolysin-inhibitor complex were tested 

against a subset of the Cambridge Crystallographic Database . Several 

compounds were returned which satisfied various aspects of the query , 

including fitting, within the active site. Combination of segments of 

compounds which satisfy partial queries should provide a method for 

generating unique compounds with affinity for sites of known 

three-dimensional structure . 
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A new approach to applications of the pattern recognition methods in 

analytical chemistry. IV. Automatic identification of structural 

fragments in organic compounds. 

AUTHOR: Hippe, Z. S. . ; Kerste, A. ; Varmuza, K. 

CORPORATE SOURCE: Univ. Information Technol. and Management, 35-225 
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PUBLICATION DATE: 2001 (1996200100) LANGUAGE: English 

ABSTRACT: In this paper (part of a sequence devoted to automatic 
identification of organic substructures) a methodology of searching for 
optional classifiers for selected aromatic fragments embedded in 
organic molecules is briefly described. The developed methodology 
uses low-resolution mass spectra and employs computer program SCANKEE 
to create the databases for mass spectra and to search them to 
create spectrum-substructure correlation tables, and finally to convert 
automatically these tables into the rules database which enable 
effective concluding. 
IDENTIFIERS: computer programs - SCANKEE, for pattern recognition based on 
structural fragments, in identn. of organic compounds, by MS ; mass 
spectrometry (MS) - in identn. of organic compounds, computer programs 
for 
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SOEM: A self -optimized method for protein secondary structure prediction 
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CODEN: PRENE ISSN: 0269-2139 

DOCUMENT TYPE: Journal; Article 

LANGUAGE: ENGLISH SUMMARY LANGUAGE: ENGLISH 

A new method called the self-optimized prediction method (SOPM) has been 
developed to improve the success rate in the prediction of the secondary 
structure of proteins. This new method has been checked against an updated 
release of the Kabsch and Sander database, 1 DATABASE. DSSP 1 , comprising 239 
protein chains . The first step of the SOPM is to build sub- databases 
of protein sequences and their known secondary structures drawn from 
1 DATABASE . DSSP 1 by (i) making binary comparisons of all protein sequences 
and (ii) taking into account the prediction of structural classes of 
proteins. The second step is to submit each protein of the sub-database to 
a secondary structure prediction using a predictive algorithm based on 
sequence similarity. The third step is to iteratively determine the 
predictive parameters that optimize the prediction quality on the whole 
sub-database. The last step is to apply the final parameters to the query 
sequence. This new method correctly predicts 69% of amino acids for a 
three-state description of the secondary structure (alpha helix, beta sheet 
and coil) in the whole database (46 011 amino acids) . The correlation 
coefficients are C(alpha) = 0.54, C(beta) = 0.50 and C(c) « 0.48. Root mean 
square deviations of 10% in the secondary structure content are obtained. 
Implications for the users are drawn so as to derive an accuracy at the 
amino acid level and provide the user with a guide for secondary structure 
prediction. The SOPM method is available by anonymous ftp to ibcp.fr. 

MEDICAL DESCRIPTORS: 
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algorithm; amino acid sequence; article; comparative study; data base; 
priority journal; sequence analysis; statistical analysis; technique 
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Creation and characterization of a new, non-redundant fragment data bank. 
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Protein engineering (ENGLAND) Jun 1997, 10 (6) p659-64, ISSN 
0269-2139 Journal Code: 8801484 

Document type: Journal Article 
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Main Citation Owner: NLM 

Record type: Completed 
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The success achieved for protein structure prediction of loop regions 
with insertions and deletions by knowledge-based methods depends on the 
quality of the underlying information, i.e. a. fragment data bank as 
complete as possible is needed. However, the greater the number of proteins 
contributing to the data base the more redundant information is included, 
which leads to structurally similar proposals in loop predictions and to 
longer times for extracting fragments. So it is not only necessary to 
increase the number of proteins for building the loop data base but 
also to cluster the resulting fragments according to their structural 
similarities in order to remove redundancy. Here, a new, non-redundant 
fragment data bank is described, which is based on all proteins in the 
Brookhaven Protein Data Bank (release 7/95) with a resolution > or = 2 . 0 A 
and which can be updated easily by including new information from 
structures to be solved in the future. In the clustering process presented, 
the resulting clusters are optimized in several cycles until 
self-consistency. In this way all redundant information is removed without 
loosing any significantly different fragments. Finally the resulting 
fragment data bank is analysed with respect to its completeness. 

Descriptors: Computational Biology — methods — MT; ^Databases, Factual; * 
Peptide Fragments — analysis--AN; Algorithms; Amino Acid Sequence; 

Cluster Analysis; Protein Structure, Secondary; Protein Structure, Tertiary 
; Sequence Homology, Amino Acid; Structure-Activity Relationship 
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Record Date Created: 19971010 
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Chirbase: A molecular database for storage and retrieval of chromatographic 
chiral separations 

AUTHOR: Roussel Christian; Piras Patrick 

AUTHOR ADDRESS: ENSSPICAM, CNRS URA 1410, University Aix-Marseille III, 

13397 Marseille Cedex 13, France** France 
JOURNAL: Pure and Applied Chemistry 65 (2): p235-244 1993 
ISSN: 0033-4545 
DOCUMENT TYPE: Article 
RECORD TYPE: Abstract 
LANGUAGE: English 

ABSTRACT: In order to meet the strong demand for storage and retrieval of 
chiral separations, we have developed Chirbase a database build on 
Chembase from Molecular Design Limited, a very powerful and well spread 
software. Chirbase allows the selection of the most promising conditions 
for a given chiral separation by searching and retrieving at the same 
time molecular fragments issued from the compound and from the 
stationary phase. 
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Applications — Computational Biology; Methods and Techniques 
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Business Information Services, the Procter & Gamble Company 
At the end of the 20th century we find ourselves with a plethora of 
systems for searching the chemical structures in patents. The Derwent 
fragmentation code for non-polymeric structures has code terms applicable 
from 1963, 1970, 1972, and 1981. The time-ranged fragment coding is 
searched directly in the bibliographic World Patents Index databases 
(DWPI). There are two different chemical fragmentation codes for 
structures in the IFI CLAIMS US patents encoded between 1972 and the 
present. The IFI fragments must be searched for specific registered 
compounds in the CLAIMS Reference file and crossed over to the 
bibliographic UDB and CDB files, where the fragmentation code strategy is 

searched again for generic structures and infrequently encountered 
molecules . Chemical Abstracts Registry file has topological indexing of 
specific compounds, indeed, from patents since 1957... 

...to the bibliographic CA and CAOLD files. Topological indexing of patents 
published since 1988 are searched directly in the companion MARPAT file. 
The Questel orbit search service offers topological searching with the 
Markush DARC system of the Merged Markush Service, which contains indexing 
of patents . . . 
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Mapping the Protein Universe 

Holm, Liisa; Sander, Chris 
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Publication Date: 8-02-1996 (960802) Publication Year: 1996 
Document Type: Journal ISSN: 0036-8075 
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Section Heading: Articles 
Word Count: 6817 

(THIS IS THE FULLTEXT) 

...Text: misleading when subtle irregularities in the coordinates lead 
to spurious differences in these vectors for proteins that are actually 
similar in shape. The algorithm works by storing, in a way convenient for 
geometrical lookup, a list of spatial relations between such vectors 
taken from database proteins (B8) . Here, lookup (or "hashing") is 
conceptually similar to looking up names in a telephone book. The lookup 
procedure matches the vector relations taken from the query protein 
with those in the stored list and proceeds to sample a limited set of 
spatial superimpositions whenever enough matches are found between the 
query protein and a database protein . Finally, a dynamic programming 
step refines these superimpositions and generates detailed residue -level 
alignments. The search of one structure against the structure database 

of several thousand structures typically takes only about 5 min on a 
computer workstation. Other... 

...achieve similar speed (B7) . In this way, a large portion (about 90%) of 
all significant protein - protein shape similarities can be found (Fig. 
3A. . . 
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Language: English Record Type: Fulltext 
Document Type: Magazine/ Journal; General Trade 
Word Count: 124 4 

measured changes in thousands of proteins in healthy and diseased 
spinal fluids before narrowing the list down to 10 to 20 proteins 
strongly correlated with memory loss. Arobotic arm carved out the proteins 

of interest, grabbing each sample and breaking it into fragments . It 
loaded the fragments into a mass-spec machine for sequencing. 

OGS matched those proteins against Incyte's LifeSeq gene 
database and three public databases . It found proteins never seen 
before in Alzheimer's. Altogether, it took OGSa year to complete the work 
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Word Count: 1224 
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TEXT: 

...powerful bioanalysis tools. The database contains information generated 
by analyzing more than 1 million gene fragments , representing 
approximately 100,000 distinct human genes. This drag-discovery tool will 
be used by. . . 

...4.0 incorporates extensive cross-referencing between Incyte sequences 
and GenBank, the repository of public genetic information sponsored by 
the National Center for Biotechnology Information (NCBI) . The resulting 
LIFESEQ annotations are... 

...DNA Sequence database with the Gene Expression database. This enables 
scientists to manipulate DNA or protein sequence alignments and integrate 
them with the gene-expression profiles of different tissues and cell... 

... Incyte f s goal is to make the LIFESEQ database the product of choice for 
scientists seeking to analyze and manage both proprietary and public 
genomic data sets. Toward this end, Incyte... 

...Marketing. "Scientists can use LIFESEQ to perform the electronic 
equivalents of biological experiments, such as comparing the 
gene-expression profiles of 'normal 1 and 'diseased* tissues. Each of these 
electronic analyses takes just seconds in the computer, compared with 
weeks of work in a traditional laboratory." What is LIFESEQ? The LIFESEQ 
database is . . . 

...largest and most powerful collections of human genomic data. It provides 
a picture of cellular genetics at a level of detail never before 
possible, helping researchers determine which genes, both known... 

...the way pharmaceutical companies conduct research, develop drugs, and 
even diagnose and treat diseases. In building the LIFESEQ database , 
Incyte harnesses the power of high-throughput sequencing to decipher the 
structure of DNA (deoxyribonucleic acid) , the molecule that makes up our 
chromosomes and determines heredity. It then uses sophisticated bioanalysis 
software to . . . 

...access to robust sequence-analysis tools such as BLAST, which allows 
researchers to sort and search the data in their quest for promising new 
drag targets. The Gene Expression database contains... 

...of "point-and-click" biology. For example, with just a few mouse clicks, 
scientists can compare the genes functioning in healthy prostate tissue 
with those active in prostate cancer. In addition... 

...from Incyte 1 s consultation with our scientists to enhance the product's 
integrated approach to genetic database mining for target identification, 
confirmation, and validation?' Other Database Modules To complement and 
expand. . . 

...Incyte is developing new generations of database modules. The Gene 
Mapping module identifies the chromosomal locations for selected gene 
sequences and promises to be a valuable resource in the hunt for... Its 
LIFESEQ and Gehe Mapping databases integrate bioinf ormatics software with 
both proprietary and publicly available genetic information to create an 



information-based tool used by pharmaceutical companies in drug discovery 
and. . . 



